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Abstract 

Background: In Gram-negative bacteria, the outer membrane is composed of an asymmetric lipid bilayer of 
phopspholipids and lipopolysaccharides, and the transmembrane proteins that reside in this membrane are almost 
exclusively (3-barrel proteins. These proteins are inserted into the membrane by a highly conserved and essential 
machinery, the BAM complex. It recognizes its substrates, unfolded outer membrane proteins (OMPs), through a 
C-terminal motif that has been speculated to be species-specific, based on theoretical and experimental results 
from only two species, Escherichia coli and Neisseria meningitidis, where it was shown on the basis of individual 
sequences and motifs that OMPs from the one cannot easily be over expressed in the other, unless the C-terminal 
motif was adapted. In order to determine whether this species specificity is a general phenomenon, we undertook 
a large-scale bioinformatics study on all predicted OMPs from 437 fully sequenced proteobacterial strains. 

Results: We were able to verify the incompatibility reported between Escherichia coli and Neisseria meningitidis, 
using clustering techniques based on the pairwise Hellinger distance between sequence spaces for the C-terminal 
motifs of individual organisms. We noticed that the amino acid position reported to be responsible for this 
incompatibility between Escherichia coli and Neisseria meningitidis does not play a major role for determining 
species specificity of OMP recognition by the BAM complex. Instead, we found that the signal is more diffuse, and 
that for most organism pairs, the difference between the signals is hard to detect. Notable exceptions are the 
Neisseriales, and Helicobacter spp. For both of these organism groups, we describe the specific sequence 
requirements that are at the basis of the observed difference. 

Conclusions: Based on the finding that the differences between the recognition motifs of almost all organisms are 
small, we assume that heterologous overexpression of almost all OMPs should be feasible in £ coli and other 
Gram-negative bacterial model organisms. This is relevant especially for biotechnology applications, where 
recombinant OMPs are used e.g. for the development of vaccines. For the species in which the motif is significantly 
different, we identify the residues mainly responsible for this difference that can now be changed in heterologous 
expression experiments to yield functional proteins. 

Keywords: Outer membrane (3-barrel protein biogenesis, Clustering, Hellinger distance, CLANS, Species specificity, 
Short linear motifs, GLAM2, C-terminal (3-strand, BamA, (3-barrel assembly machinery, Gram-negative bacteria, Outer 
membrane, Principal component analysis, Frequency plots 
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Background 

In Gram-negative bacteria, the cytoplasm is surrounded 
by inner membrane (IM) and outer membrane (OM), 
which are separated by an inter-membrane space, called 
the periplasm. Most of the newly synthesized proteome 
remains in the cytoplasm, but in addition, different 
machineries are involved in the translocation of non- 
cytoplasmic proteins to different subcellular localiza- 
tions, including the inner or outer membrane, the 
periplasmic space, or the extracellular space. Some of 
these machineries recognize their substrate proteins by 
an N-terminal signal peptide (SP) for the translocation 
process, while other machineries are SP-independent. 
The IM, which is a phospholipid lipid bilayer, is mostly 
occupied by transmembrane a-helical proteins, by 
inner membrane lipoproteins on its periplasmic side, 
and by other membrane associated proteins on both 
sides of the membrane. In contrast, the asymmetric 
OM, which consists of phospholipids only in the inner 
leaflet of the membrane and lipopolysaccharides in the 
outer leaflet, is mostly occupied by transmembrane 
(outer membrane) p-barrel proteins, and by outer 
membrane lipoproteins on its periplasmic side [1]. 

The biogenesis of an outer membrane p-barrel protein 
(OMP) begins with the translocation of the newly synthe- 
sized, unfolded protein across the IM into the periplasm 
via the Sec translocation machinery, which requires a 
cleavable general SP. Once the unfolded OMP reaches the 
periplasm, it uses the SurA or Skp-DegP pathway to reach 
the OM. SurA, Skp and DegP are periplasmic chaperones, 
which interact with unfolded OMPs by protecting them 
from aggregation and thus help them to reach the OM 
[2,3]. It has been shown that the SurA pathway and the 
Skp/DegP pathway can work in parallel, but that the SurA 
pathway plays an important role when the cell is under 
normal growth conditions, while under stress conditions, 
the Skp-DegP pathway plays the major role [4,5]. 

Once periplasmic chaperones deliver the OMPs to the 
OM, the folding and insertion of the protein into the 
membrane is mediated by the p-barrel assembly machin- 
ery (BAM), without an external energy source [6] such 
as ATP or ion gradients. This machinery involves an es- 
sential multi-domain protein, BamA (Omp85), which 
consists of a 16-stranded transmembrane p-barrel do- 
main, and of a large periplasmic part that consists of five 
POTRA (polypeptide transport-associated) domains. 
BamA is highly conserved in Gram-negative bacteria 
and also has homologues in mitochondria (Sam50) and 
chloroplasts (Toc75-V) [2]. In addition, the BAM com- 
plex, at least in E. coli, consists of four lipoproteins, 
BamB, BamC, BamD and BamE, among which only 
BamD is essential and conserved in most Gram-negative 
bacteria [2], Recent HMM-based sequence analysis by 
Anwari et al [7] showed that BamB and BamE are 



mainly present in a-, p- and y-proteobacteria, while 
BamC is present only in p- and y-proteobacteria. They 
also found a new lipoprotein subunit in the BAM com- 
plex, named BamF, which is present exclusively in a- 
proteobacteria.The BAM complex recognizes OMPs as 
its substrates via binding to an amphipathic C-terminal 
p-strand of the unfolded p-barrel [8], but the exact 
binding mode is still not clear. It was suggested that 
C-terminal p-strand binds to BamD [9], once the 
unfolded OMPs are delivered to the BAM complex by 
periplasmic chaperones. But a recent BamC and BamD 
subcomplex crystal structure shows that the unstruc- 
tured N-terminus of BamC binds to the proposed 
substrate binding site of BamD [4]. The C-terminal 
p-strand of an OMP p-barrel domain typically con- 
tains an aromatic residue at its C-terminus. It has been 
reported that deletion or substitution of this C-terminal 
residue negatively affects the biogenesis of OMPs [10,11]. 
Also, in vitro studies showed that the E. coli OM porin 
PhoE, when lacking its C-terminal Phe residue, fails to 
open the Omp85/BamA channel [8]. In both studies, 
overexpression of the mutant OMP was lethal to the 
cells. At lower concentration, the mutant protein was 
tolerated and got inserted into the membrane. This leads 
to the suggestion that a weak insertion signal other than 
the C-terminal residue or P-strand is present [8]. 

Robert et al [8] observed that the N. meningitidis OM 
porin PorA or its C-terminal p-strand did not open the 
E. coli Omp85/BamA channel, and the comparison of 
the C-terminal p-strands from N. meningitidis and E. 
coli OMPs showed a high preference of positive amino 
acids at the penultimate (+2) position in neisserial 
OMPs. When they mutated E. coli PhoE or its C- 
terminal p-strand, changing Gin for Lys at the +2 pos- 
ition, it did not open the channel any more; in contrast, 
a Neisseria PorA peptide with Gin instead of Lys 
increased the channel activity considerably. These stud- 
ies and the fact that high concentrations of neisserial 
OMPs were lethal in E. coli cells, lead to the conclusion 
that the C-terminal insertion signal is species-specific 
and that the residues at the +2 position were important 
for this phenomenon. The number of peptides/proteins 
used in the comparison in the study [8] was very low, 
compared to the total number of OMPs present in the 
E. coli or N. meningitidis genomes; moreover, the 
phenomenon was only compared between two organ- 
isms, one p- and one y-proteobacterial species. Since 
neisserial OMPs could be expressed in E. coli at low ex- 
pression rates, either the neisserial C-terminal insertion 
signal is weakly recognized by E. coli BAM complex, or 
other p-strands in the full length protein might act as a 
weak insertion signal. 

Thus, there seems to be at least some overlap in the 
peptide recognition. The intention of this study was to 
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use computational methods to quantify this overlap, and 
to find out whether the observed (partial) species specifi- 
city of the insertion signal is exhibited by all Gram- 
negative bacterial organisms. 

Results and discussion 

We identified 22,447 OMPs from 437 Gram-negative bac- 
teria using PSORTb [12], CELLO [13] and HHomp [14] as 
described in the methods section. These OMPs can be 
classified into different outer membrane protein (OMP) 
classes/families based on their function and the number of 
p-strands present in them, as these two features are usu- 
ally coupled [14-17]. We used HHomp [14] to classify the 
proteins into different OMP families. A brief summary of 
the OMP classification obtained from HHomp [14] for 
our data set is shown in Table 1. We then used ProfTMB 
[18] and PSIPRED [19] annotations to identify and extract 
the C-terminal p-strands from the OMPs. To evaluate the 
phenomenon of species specificity, we initially tried to 
cluster the C-terminal p-strands using different methods, 
such as sequence based clustering in CLANS [20] and 
organism-specific PSSM profile-based hierarchical cluster- 
ing. Since the sequences were highly similar and very 
short, the results obtained from these methods were not 
helpful to our analysis. We then used chemical descriptors 
and represented each amino acid in the peptides by five- 
dimensional vectors, thus representing each 10-residue 
peptide as a 50-dimensional vector. Next, we used dimen- 
sionality reduction techniques (principal component ana- 
lysis) to reduce the dimensions to 12 (the lowest number 
of dimensions that still contains most of the difference in- 
formation, see Methods). We then used all peptide vectors 
from an organism to derive a multivariate Gaussian distri- 
bution, which we describe as the peptide sequence space' 
of the organism. The overlap between these multidimen- 
sional peptide sequence spaces (multivariate Gaussian dis- 
tributions) was calculated using a statistical theory 



method, the Hellinger distance. As described in the meth- 
ods section, the pairwise overlaps between organism se- 
quence spaces were used to cluster them in CLANS [20] . 

Clustering of organisms based on C-terminal P-strands 

The pairwise comparison of the overlap between sequence 
spaces should help us to predict the similarity between the 
C-terminal insertion signal peptides, and how high the 
probability is that the protein of one organism can be 
recognized by the insertion machinery of another organ- 
ism. When there is a complete overlap of sequence space 
between two organisms, we assume that all C-terminal in- 
sertion signals from one organism will be recognized and 
functionally expressed by another organisms BAM com- 
plex and vice-versa. When there is only little overlap be- 
tween the sequence spaces of two organisms, we assume 
that only a small number of C-terminal insertion signals 
from one organism will be recognized by another organ- 
isms BAM complex. When there is no overlap, we assume 
that there is a general incompatibility. 

As described in the methods section, we examined the 
overlap of peptide sequence spaces between 437 Gram- 
negative bacterial organisms and used the pairwise overlap 
measurement to cluster the organisms. Since the C- 
terminal p-strands are highly conserved between all OMPs 
[21], it was very difficult to select a particular cut-off for 
the distance measure. Thus, the clustering was carried out 
using all the distance measures obtained from the calcula- 
tions. In the resulting 2D cluster map (Figure 1A), each 
node is one out of the 437 organisms, and they are colored 
based on the taxonomic classes (see the figure legend). 
During clustering with default clustering parameters in 
CLANS [20], the organisms tended to collapse into a sin- 
gle point, which illustrates that there is large overlap be- 
tween the peptide sequence spaces. Thus, we introduced 
very high repulsion values and minimum attraction values 
in CLANS [20] during clustering. With these settings the 



Table 1 Dataset classified based on OMP class 



OMP # of (3- Total # OMP class found in # of organisms in different proteobacteria class 
class strands of — 
peptides 



Function/Protein family 



OMP.8 


8 


2300 


71 


77 


227 


24 


10 


Membrane anchors [15] 


OMP.10 


10 


95 


5 


2 


66 


2 


2 


Bacterial proteases [16] 


OMP.12 


12 


1550 


60 


75 


212 


18 


10 


Integral membrane enzymes [15] 


OMP.14 


14 


572 


47 


38 


221 


20 


22 


Long chain fatty acid transporter [17] 


OMP.16 


16 


2477 


41 


86 


210 


23 


8 


General porins [15] 


OMP.18 


18 


327 


2 


14 


134 


7 


1 


Substrate specific porins [15] 


OMP.22 


22 


7462 


71 


86 


231 


25 


23 


TonB-dependent receptors [15] 


OMP.nn 


Not known 


7591 


71 


86 


231 


26 


23 




OMP.hypo 


Not known 


73 


2 


18 


33 


9 


1 





The OMP class of a protein was predicted by HHomp [14]. HHOmp defines the class based on the similarity to a closely related OMP structure. When HHomp 
cannot find a related structure, it classifies the proteins in OMP.nn. OMP.hypo proteins are hypothetical proteins [14]. 
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• Organism with < 30 peptides 

• Organism with > 31 and < 40 peptides 
o Organism with > 41 and < 60 peptides 

• Organism with > 61 and < 80 peptides 

• Organism with > 80 peptides 



Figure 1 Cluster map based on 437 sequenced Gram-negative organisms. In the cluster map each node represents one organism. The 
Hellinger distance was used to calculate the pairwise overlap between the multi-dimensional peptide sequence spaces of organisms. The 
calculated similarity or overlap was used to cluster the organism in CLANS. Figure 1A is colored by taxonomic class and Figure 1B is colored by 
the number of peptides in each organism. 



organisms formed a central big cluster, but separated 
crudely according to their taxonomic classes. We repeated 
the clustering multiple times to ensure that this separation 
is reproducible. In the cluster map (Figure 1A), p- and y- 
Proteobacteria form two sub-clusters, separated by the a- 
Proteobacteria. The very few 5-Proteobacteria in our data 
set cluster in the periphery of the y-proteobacterial cluster. 
In the cluster map, E. coli strains cluster along with other 
y- Proteobacteria. Even though Neisseria species cluster 
along with other p-Proteobacteria, they form a sub-cluster 
and are found in the periphery of the p-proteobacterial 
cluster. Note also that in this map, Helicobacter species 
form a distinct cluster well separated from the rest of the 
organisms. This core cluster includes H pylori strains, H 
acinonychis and H. felis, but not H hepaticus and H mus- 
telae species. The remaining e-proteobacteria species are 
scattered in the periphery of the cluster map. The distinct 



cluster formed by most Helicobacter species demonstrates 
that the sequence spaces of Helicobacter species are sig- 
nificantly different from rest of the organisms. The neis- 
serial cluster had only very few strong connections even 
with other p-proteobacterial organisms, which means the 
overlap or similarity of peptide sequence space between 
Neisseriales with rest of the p-Proteobacteria is compara- 
tively low. When we used stringent thresholds for the dis- 
tance measure, we noticed that the Neisseria and 
Helicobacter clusters started to move even further away 
from the center of the cluster map. 

Control experiments for clustering: randomly shuffled 
peptide sequences lose the signal for clustering 

We noticed that the organisms seen at the periphery of 
the cluster map had a lower overall number of peptides, 
while organisms with more peptides are typically seen at 
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the center of the circle. The cluster map in Figure IB is 
colored based on the number of extracted peptides per 
organism. In Figure IB, there are 99 organisms which 
have < 30 peptides (colored in pink), 77 organisms with 
31 to 40 peptides (colored in blue), 136 organisms 
with 41 to 60 peptides (colored in green), 66 organisms 
with 61 to 80 peptides (colored in red), and 59 organisms 
with more than 80 peptides (colored in brown). Even 
though H. pylori strains have a comparably high number 
of peptides (43 to 51 peptides), they still form a separate 
cluster in the periphery of the cluster map; therefore 
there must be an underlying organism-specific signal 
from the contributing peptides at least in this case. 

To confirm the presence of the organism-specific sig- 
nal, we took peptides from all the organisms and 
shuffled the positions of their amino acids randomly, 
and derived a new similarity matrix as mentioned in the 
method section which we clustered in CLANS [20]. 
Figure 2A shows the results from this test, where one 
can notice the taxonomic specific separations were com- 
pletely lost. The cluster map in Figure 2B, colored based 
on the abundance of OMPs in an organism, shows that 
organisms with more peptides are in the center, and 
organisms with fewer peptides move to the outer rim of 
the cluster map. This test confirms that the there is a 
species-specific signal for which the position of the indi- 
vidual amino acids is important; this is lost when the 
residues in the peptides are shuffled randomly. 

High preference of positively charged residues at the +2 
position in Neisseria species 

The comparison of the C-terminal peptide sequences in 
the p-barrel of selected OMPs of E. coli and N. meningiti- 
dis peptides by Robert el al [8] showed a strong preference 
for positively charged amino acids (Arg and Lys) at the +2 
position in neisserial OMPs, which led to the suggestion 
of a distinct species specificity of the C-terminal p-strand 



recognition. Since the comparison was made from 11 and 
9 OMPs from E. coli and N. meningitidis, respectively, we 
wanted to confirm this with a larger set of OMPs from the 
same bacterial species. The frequency plots in Figure 3A 
and B were created from 171 (E. coli) and 50 {^.meningiti- 
dis) unique C-terminal p-strands. Comparison between 
these plots demonstrates the high preference of Arg and 
Lys at the +2 position in neisserial OMPs. When we 
checked the frequency of amino acids at the +2 position 
for 22,447 peptides from all 437 organisms, we noticed 
that in the complete dataset, Arg and Lys are the top two 
preferred residues at the +2 position, and that they are 
present in 31.62% (3996 + 3102) of the peptides. A similar 
frequency of Arg and Lys (31.32% (2262 + 1794 out of 
12,949 unique peptides)) is observed when only taking 
unique peptides into account (i.e. when duplicates are 
removed from the database). Figure 4 shows the percent- 
age of Arg and Lys at the +2 position in 437 organisms; in 
this plot, Neisseria strains stand apart even from other 
p-proteobacterial organisms, and also from all other 
proteobacterial organisms. Neisseria strains (and a few 
a-proteobacterial organisms) have more than 60% of pep- 
tides with positively charged residues at the +2 position. 
Note, though, that also in all other organisms, positive 
charges are abundant there; for example, different Escheri- 
chia strains also have 25-40% of peptides with Arg and 
Lys at the +2 position. Thus, when these proteins are 
expressed, the Escherichia BAM complex should be able 
to recognize proteins with positively charged residues at 
+2 positions. As a matter of fact, there is experimental evi- 
dence for the functional expression of OMPs with posi- 
tively charged residues at the +2 position in E. coli [22]. 

High preference of Histidine at the +3 position in porins 
(16-stranded OMPs) from p-proteobacteria 

In the frequency plots (Figure 5) generated for each 
taxonomic class of Proteobacteria, we observed that the 



B 





Figure 2 CLANS cluster map of randomly shuffled peptides from 437 organisms. Figure 2A is colored by taxonomic class and Figure 2B is 
colored by the number of peptides in an organism. Colors are similar to Figure 1. 
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t 



+2 position 



+2 position 



Figure 3 Frequency plots derived from unique C-terminal insertion signal peptides for Escherichia (Figure 3A) and Neisseria (Figure 3B) 
strains. Frequency plots were made from 188 unique peptides of 31 Escherichia strains and 50 unique peptides of 7 Neisseria strains. The +2 
position is indicated by the arrow in the figure. Escherichia strains (Figure 3A) have no strong preference for any amino acid at the +2 position, 
whereas Neisseria strains (Figure 3B) have a strong preference for positively charged amino acids (Arg and Lys) at the +2 position. Hydrophobic 
residues are colored in blue and polar residues are colored in red. 



frequency of amino acids in the +2 positions were com- 
parable, with the possible exception of the Neisseriae. In 
contrast to that, we observed a prevalence (up to 57% 
frequency) of His at the +3 position for (3-proteobacteria, 
while the other taxonomic classes shared a similar, low 



(<15%) frequency of His in that position (Figure 6). 80% 
of the peptides with His at the +3 position belong to the 
(3-proteobacteria and more than 92% of these peptides 
stem from 16-stranded p-barrel proteins (Porins, 
denoted as the OMP.16 class by HHOmp). None of the 
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Figure 4 Percentage of Arg and Lys at +2 positions. We calculated the percentage of Arg and Lys residues at the +2 position from all unique 
peptides from the 437 organisms; color is based on taxonomic class. The Neisseria strains show a high preference for positively charged amino 
acids at the +2 position compared to other organisms. 
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Figure 5 Frequency plots of C-terminal (3-strands from Proteobacteria. Frequency plots generated from unique peptides of a-proteobacteria 

are shown in Figure 5A, of (3-Proteobacteria in Figure 5B, of y-Proteobacteria in Figure 5C, of 6-Proteobacteria in Figure 5D and of 

e-Proteobacteria in Figure 5E. The frequency plots are overall very similar; an exception is the high frequency of His at the +3 position in 

(3-Proteobacteria and of Tyr at the +5 position in e-Proteobacteria. 
k J 



Escherichia C-terminal p-strands in our database have 
His at the +3 position, and experiments by Robert et al. 
were done with a Neisseria PorA peptide with a His at 
the +3 position. This might be the true reason why E. 
coli BamA didn't recognize neisserial peptides. When we 
further examined the available structures of porins from 
Neisseria, and we found the His at the +3 position to be 
present in the trimerization interface of the porins. Since 
the vast majority of the His residues at the +3 position 
of the C-terminal motifs were from 16-stranded porins 
that typically trimerize, this position might be relevant 
for trimerization in neisserial porins. 

High preference of Tyrosine at the +5 position in 
Helicobacter species 

The separate cluster formed by Helicobacter species was 
an interesting observation for us, because it forms a 
more distinct cluster than Neisseria. This means that the 
peptide sequence space of Helicobacter species is more 
different from the rest of the organisms than even the 



one of Neisseriales. But the frequency plots (Figure 7 A 
and B), generated from unique peptides of all Helicobac- 
ter species and H pylori strains respectively, did not 
show a strong preference for any amino acid at either 
the +2 position and the strong preference of Tyr at +3 
position is common among the c-terminal insertion sig- 
nals. But, we noticed an uncommon strong preference of 
Tyr at the +5 position. The presence of a hydrophobic 
residue is common at +5 positions, but the presence of 
aromatic hydrophobic amino acids (especially Tyr) at 
the +5 position are highly preferred in H pylori strains 
compared to other organisms (Figure 8 A and B). Since 
the peptide sequence space depends upon the entire se- 
quence, we cannot confirm that the separate cluster 
formed by the H pylori is exclusively due to the residues 
at this one particular position. There is experimental evi- 
dence that the expression of various H pylori OMPs in 
E. coli is problematic [23]. Fisher et al noticed that as 
long as the expressed H pylori OMP remains in the 
cytoplasm of E. coli, it is not lethal, but that once it is 
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Figure 6 Frequency of His at the +3 position. The percentage of His at +3 was calculated from all unique peptides from 437 organisms. A 
high preference for His at +3 is observed for 16-stranded OMPs of (3-Proteobacteria. Since there is a high number of 16-stranded OMPs in 
Burkholderia strains (see Additional file 1 and Additional file 2), they were also annotated in the plot. 



secreted to the periplasm by the Sec machinery, it 
becomes lethal to E. coli. They also mentioned - without 
showing data - that removal of the C-terminal (3-barrel 
region resulted in toleration of the proteins in the peri- 
plasmic space. This probably means that the E. coli 
BAM complex didn't recognize the C-terminal p-strands 




+5 position 



Figure 7 Frequency plot of unique C-terminal P-strands from 
Helicobacter species. 163 unique C-terminal insertion signals from 
14 Helicobacter strains were used to generate this plot. The +5 
position which has the strong preference of Tyr is marked with the 
arrow. 



of the H. pylori OMPs, and the subsequent aggregation 
of the OMPs in the periplasm and the blockage of the 
BAM complex lead to the lethality. The authors con- 
cluded that the difference in OM lipid composition of 
Helicobacter, which contains cholesteryl glycosides [24], 
might have imposed some structural constraints on the 
OMP structure, and that this structural change is not 
tolerated by other organisms resulting in the observed 
lethality of such constructs. 

OMP class-specific and taxonomy class-specific signals 

We noticed that in some organisms, certain OMP classes 
of proteins are over-represented (see figure in Additional 
file 1). Examples are the prevalence of 16-stranded (3- 
barrels in the genomes of some (3-proteobacteria and 
22-stranded p-barrels in the genomes of some a- 
proteobacteria (see Additional file 2). Moreover, of the 
22,447 sequences in the data set, 33.82% (7591) 
sequences were annotated as OMP.nn by HHomp [14], 
which means there was no closely related homolog of 
known structure found for these proteins and thus, the 
number of p-strands in them is unknown. Thus, it is 
not possible to filter the dataset based on OMP class 
alone. But, as a control, we removed one OMP class at 
a time from the dataset and checked for differences in 
the clustering. When removing OMP.8 (Figure 9A) and 
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Figure 8 The percentage of Tyr (Figure 8A) and aromatic hydrophobic amino acids (Figure 8B) at the +5 position. For Figure 8A, we 
calculated the percentage of Tyr at the +5 position from all unique peptides from 437 organisms and for Figure 8B, we calculated the frequency 
of Tyr, Phe and Trp at the +5 position from all unique peptides from 437 organisms. In both plots Helicobacter strains shows a high preference of 
Tyr and aromatic amino acids at the +5 position. 



OMP.12 (Figure 9B), two OMP classes that are not 
overrepresented in any of the taxonomy classes; this did 
not visibly affect the clustering. But when we removed 
the OMP. 16 (Figure 9C) or the OMP.22 (Figure 9D) 
class, which have a high prevalence in p-proteobacteria 
and a-proteobacteria, respectively, this changed the 
clustering behavior of the respective taxonomic classes 
significantly; the organisms got scattered away from 
their position in the cluster compared to the situation 
in Figure 1A. This shows that the over-representation 
of certain OMP classes can influence the peptide sequence 
space, but since the proteins from over-represented 
OMP classes still contribute to the real sequence space 
of the organisms, we decided not to correct for this 
effect and used all peptides from the organisms in our 
experiments. 

We also examined whether there is a more general sig- 
nal from OMP classes, other than the signal from the 
over-representation of an individual OMP class that 
would influence the observed organism-specific signal. 
For this, we separated the peptides from an organism 
based on the OMP classification and selected the entities 
which had more than five unique peptides for further 
analysis. From this, we created two data sets of entities; 
one data set containing organisms from all taxonomic 
classes, but with C-terminal insertion signals only from 
22-stranded OMPs, and a second data set containing 
organisms only from y-proteobacteria, but in which 



individual organisms were split into multiple entities, 
each representing an OMP class that contained more 
than five unique C-terminal insertion signals. We clus- 
tered these data sets separately and the resulting cluster 
maps are shown in Figure 10A and B. In the cluster map 
in Figure 10A, each node is an organism, but only the 
C-terminal insertion signals from 22-stranded OMP 
class were considered for the clustering. In this cluster 
map, all the organisms clustered based on their taxo- 
nomic classes. In the cluster map in Figure 10B, all 
organisms are from y-proteobacteria, but organisms with 
multiple OMP classes with more than five unique C- 
terminal insertion signals per class will result in multiple 
representative nodes. These nodes which belong to dif- 
ferent OMP classes clustered based on the OMP classes. 
This confirms that there are independent contributions 
to the overall signal, from both the OMP classes and 
from taxonomy. Within one OMP class, there still is di- 
vergence in accordance with different taxonomic classes; 
but overrepresentation of a single OMP class in an or- 
ganism influences the average motif of an organism. 

Conclusion 

In our study, we were able to reproduce the difference 
between E. coli and Neisseria C-terminal (3-strands as 
found by Robert et al, which suggests a species-specific 
insertion signal for OMPs. But in contrast to the earlier 
report, we show that positively charged amino acids at 
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Figure 9 Control experiments to show the influence of overrepresented OMP classes. OMP classes 0MP.8 (Figure 9A), OMP. 12 (Figure 9B), 

0MP.16 (Figure 9C) and OMP.22 (Figure 9D) were removed and only organisms with more than 20 unique peptides were used in the clustering. 

Peptides belonging to OMP.nn and OMP. hypo (OMPs with unknown strand number and function) were not removed from the data set during 

the control experiments. Color legends are similar to the Figure 1 A. 
I J 



the +2 position can not be the reason for the experimen- 
tally observed species specificity between these organ- 
isms, as Escherichia also contains C-terminal (3-strands 
with positively charged amino acids at the +2 position. 
Moreover, there is experimental evidence which shows 
the functional expression of a heterologous OMP, YadA 
of Yersinia enter ocolitica, with a positively charged 
amino acids at the +2 position, in E. coli [22]. The neis- 
serial PorA protein and the neisserial C-terminal |3- 
strands used by Robert et al contained His at the +3 
position, which is common for many OMP. 16 proteins 
from (3-proteobacteria and is not found in Escherichia 
OMPs; this might be the true difference in the recogni- 
tion of C-terminal (3-strands by the Escherichia BAM 
complex. Furthermore we found that Helicobacter 
strains form a distinct cluster in the cluster map, which 
is due to their very different composition of C-terminal 
(3-strands. There is experimental evidence showing that 
expression of H pylori OMPs in E. coli is lethal, and that 
this lethality can be suppressed by removing the C- 
terminal strand. When we looked at the frequency 



motifs from Helicobacter strains we did not notice a 
strong preference of any amino acid at the +2 or the +3 
position, however we observed a strong preference of 
Tyr at the +5 position, which is not common in Escheri- 
chia or other Proteobacteria. We assume that this pos- 
ition may play an important role in the rejection of these 
C-terminal |3-strands by the E. coli BAM complex. The 
examples of Neisseria and Helicobacter show that differ- 
ent positions in the C-terminal recognition motif can be 
relevant for heterologous expression of OMPs. We pre- 
dict that in certain group of species the highly preferred 
residues in certain positions of the C-terminal insertion 
signals are responsible for the inadequate recognition of 
the C-terminal insertion signals by the E. coli BAM 
complex. In the future, mutation studies will have to be 
performed to prove the importance of these residues in 
the recognition step in the OMPs biogenesis. 

As a result of our study, we have shown that there is a 
large overlap between the signals from C-terminal insertion 
peptides of different organisms, which suggests that in most 
cases, heterologous expression should be possible. OMPs 
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Figure 10 CLANS cluster map of OMP-Organism class based entities. In figure 10A and figure 1 0B, each node is a representative of OMP- 
Organism entities that have more than five unique peptides of a single OMP class from an individual organism. In Figure 10A, entities are only 
from the OMP.22 class, which includes entities from all proteobacterial taxonomic classes. In Figure 1 0B, entities are only from y-Proteobacteria 
and include different OMP classes. 



can fold in vitro even without the help of any other proteins 
[25]. The BAM complex is an enzyme that makes the fold- 
ing of OMPs into the outer membrane more efficient by in- 
creasing the reaction rate of a natural process. Enzymes 
modify reaction rates by changing the reaction route to 
lower the activation energy, and binding/recognition is part 
of this changed route. Thus, it is also important to consider 
expression rates: poor recognition might still lead to prop- 
erly folded OMPs in the outer membrane of a heterologous 
host at low expression rates. But under overexpression con- 
ditions, the BAM machinery can probably not cope with 
poorly recognized signals that would lead to lower overall 
folding rates (considering that recognition is the first and 
probably in some cases rate-limiting step of the folding 
process). Different classes of OMPs have different folding 
rates, where small OMPs fold faster and more efficiently 
(again in vitro) than larger ones, which might explain why 
large OMPs seem to depend more heavily on an intact 
BAM machinery than small ones [26,27]. 

Since there are two different signals that contribute to 
the observed average motifs, from OMP class and from 



taxonomy, it is problematic to use averaged motifs or se- 
quence logos to determine the compatibility of a given 
protein-organism pair. The main problem here is the over- 
representation of certain OMP classes in some organism 
groups; this overrepresentation shifts the average signals. 
It is more useful to determine for an individual C-terminal 
motif form a protein to be expressed, whether it is also 
present in any of the OMPs of the host organism. 

The taxonomy-based specificity we observed here 
based on sequence space depends upon the entire pep- 
tide sequence, but at the functional level, these peptides 
are recognized based on the interacting residue positions 
in the C-terminal insertion signal peptide. The PDZ do- 
main of the bacterial periplasmic stress sensor, DegS, 
also recognizes the C-terminal YxF motif in the last (3 
strand of misfolded OMPs. This leads to the activation 
of the proteolytic pathway and the expression of DegP, 
which degrades misfolded OMPs [28,29]. Since the C- 
terminal p-strand is recognized by both the PDZ domain 
of the DegS protein and by the BAM complex, studying 
the co-evolution of interacting residues in both cases 
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would help in understanding the divergence of the C- 
terminal p-strands between different Gram-negative bac- 
terial organisms. Unfortunately, co-crystal structures of 
the BAM complex with its substrates are not available 
yet. With more experimental evidence about the sub- 
strate recognition sites for the C-terminal insertion sig- 
nal peptide in the BAM complex, the co-evolution of 
the interacting amino acids can hopefully be studied in 
the future, which may shed more light on into the evolu- 
tion of the BAM machinery in different Proteobacteria, 
and on its ability to recognize heterologous substrates 
for biotechnology applications. 

Methods 

Predicting outer membrane |3-barrel proteins 

In a previous study [30] to annotate the subcellular loca- 
lizations (SCLs) for the proteomes of 607 Gram-negative 
bacteria, we developed the program/database ClubSub-P, 
in which we used programs like CELLO [13], PSORTb 
[12] and HHomp [14] to annotate OMPs. CELLO [13] 
and PSORTb [12] use support vector classifiers to anno- 
tate different SCLs of query sequences and are much 
faster than HHomp [14] which uses HMM-HMM-based 
search algorithms to predict and classify OMPs. Thus 
we used CELLO and PSORTb to scan all the sequences 
in the clusters of the ClubSub-P database. A random 
protein was selected from a cluster where CELLO or 
PSORTb had a positive hit for an outer membrane pro- 
tein, and the sequence was analyzed with HHomp. 
When HHomp predicted a protein with more than 90% 
probability to be an OMP, we considered all the proteins 
in the cluster to be OMPs. We in addition selected all 
singleton sequences with positive prediction from 
CELLO or PSORTb and analyzed them with HHomp. 

Finding the C-terminal P-strands 

HHomp annotates/classifies OMPs based on the number 
of p-stands present in them. HHomp calculates/predicts 
this from homologous structures of OMPs. We trans- 
ferred this annotation from the best hit in HHomp runs to 
the query sequences. HHomp also annotates secondary 
structure and p-barrel strand predictions using PSIPRED 
[19] and ProfTMB [18], which was used to extract the 
C-terminal (last) p-strand/motif for each OMP. The last 
p-strand predicted by ProfTMB [18] was extracted as 
the C-terminal motif from representative sequences and 
singletons, and further filters were applied to reduce the 
false positive rate; 1) 70% of the amino acids in the 
motif should have a p-strand prediction from PSIPRED 
[19], 2) If the C-terminal of the protein is more than 4 
residues away from the C-terminus of the motif, we 
extended the predicted motif by up to 4 amino acids to 
find an aromatic hydrophobic residue [F,Y,W], else we 
extended the C-terminus of the motif to the end of the 



protein itself. 3) Additionally, if the motif length was 
less than 10 residues, we extended the motif towards its 
N-terminus. 4) Furthermore with the regular expression. 

[ A C] [YFWKLHVITMADGRE] [ A C] [YF WKLHVITMAD 
GRE] [ A C] [YFWKLHVITMADGRE] [ A C].[ A C] [ YFWHILM] 
(an updated version of BOMP[31] C-terminal pattern), we 
searched for the existence of the alternating hydrophobic 
pattern in the motif which is typical for transmembrane 
p-strands. 

Using the information from this representative C- 
terminal motif, we extracted C-terminal motifs from the 
rest of the sequences in the clusters. We used MAFFT 
[32] to align the sequences from the cluster, and used 
the start and end coordinates of the C-terminal motif 
discovered above in the representative sequences ran- 
domly selected from the clusters. Motifs were extended 
on the both sides, in cases where we encountered gaps 
in the alignment. The gaps were removed and then 
resulting motifs were subjected to alternating hydropho- 
bic pattern matching. 

The peptides we collected vary in length from 10 to 21 
residues (only six of the peptides were longer than 21). 
We then applied GLAM2 [33], a gapped motif discovery 
algorithm, to find the strongest motif with a length of 10 
from this dataset. We found 24,626 motif instances in 
25,454 sequences, and only 232 motifs in this alignment 
had gaps. The gapped motifs were removed before fur- 
ther analysis. 20,135 of the motif instances were C- 
terminal to the protein itself (which means there were 
no additional domains at the C-terminal end of the p- 
barrel proteins). 437 organisms had more than 20 
unique C-terminal p-strands, ranging from 21 to 171 
peptides in different organisms. In total, the 437 organ- 
isms yielded 22,447 peptides, of which 12,949 are unique 
peptides. 

Sequence based clustering 

Since all of the peptides are 10 amino acids in length by 
default, we used the PAM30 substitution matrix for an 
all-against-all BLAST, with an E-value cut-off of 1000 
and used the pairwise P-values to cluster the sequences 
in CLANS [20]. 

PSSM profile-based hierarchical clustering 

The relative frequencies of the 20 amino acids were cal- 
culated for all 10 positions in the peptides from an or- 
ganism. To obtain odds scores, the relative frequencies 
were simply divided by each residue s background fre- 
quency, which was calculated by shuffling the amino 
acid sequence in all the peptides from all organisms, and 
log base 2 was applied to obtain a PSSM matrix. The 
20 x 10 PSSM matrices obtained for each organism were 
stored in a single 437x200 PSSM matrix, and correlation 
distances were calculated between each organism and 
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agglomerative hierarchical clustering (average method) 
was performed via the pvclust [34], which calculates two 
types of ^-values, AU (Approximately Unbiased) ^-value 
and BP (Bootstrap Probability) value to indicate the like- 
lihood of the cluster formation. 

Peptide sequence space-based clustering 
Chemical descriptors 

To generate a peptide sequence space, each amino acid 
in the peptide sequences was represented by five chem- 
ical descriptors that are the first five principal compo- 
nents derived from 26 physiochemical descriptor 
variables using dimensionality reduction techniques [35]. 
The initial 26 physiochemical descriptor variables in- 
clude the molecular weight, experimentally determined 
retention values from seven thin-layer chromatography 
runs, van der Waals volume of the side chain, three nu- 
clear magnetic resonance shift variables, log P, six vari- 
ables for semiempirical molecular orbitals, three 
variables for total, polar and nonpolar surface area, two 
variables for side chain charge and two variables for 
hydrogen bond donor and acceptor [35]. The five princi- 
pal components derived from these 26 variables contain 
the maximal variations in the data set and they can be 
interpreted as the size, polarizability, and the lipophilic, 
steric, and electronic properties of all the amino acids 
[35]. The amino acid descriptors were originally derived 
for use as design variables in peptide design, and in the 
construction of combinatorial libraries to effectively 
search chemical property space [35]. Here we used them 
to describe the space occupied by the C-terminal (3- 
strands and to measure how strongly peptide sequences 
of different organisms overlap. Using the chemical 
descriptors, each amino acid in the peptide was con- 
verted into a 5-dimensional vector; thereby, each lOaa 
peptide was represented as a 50-dimensional vector. 
Thus, the whole set of 22,447 peptides were converted 
to a 22,447 x 50 matrix. 

Principal component analysis 

Since the dimensionality of the data set (50) is larger 
than the sample size (minimum 21 peptides per organ- 
ism), the dimensionality of the peptide vectors had to be 
reduced below the sample size (i.e., below 21 in our 
dataset) for further statistical analysis [36]. Principal 
component analysis (PCA) is a mathematical technique 
to reduce the dimensionality of data sets, while retaining 



most of the variation in the data set. This is achieved by 
projecting the original data vectors along the directions 
of maximal variation, called principal components (PCs). 
The first PC captures the maximum variation; the vari- 
ation associated with consecutive PCs decreases rapidly. 
Thus, the original data set can be mapped into a lower 
dimensional space by projecting the original data on 
those PCs representing most of the variation [36,37]. We 
used PCA to reduce the dimensionality of our peptide 
sequences (22,447 x 50 matrix) by projecting the 50 di- 
mensional chemical descriptor vectors onto the first 12 
principal components, which represent 69.05% of the 
total variation in the data. We thereby obtained a 
22,447 x 12 matrix that did not suffer from any pro- 
blems in sample size. 

Multivariate Gaussian fitting and Hellinger distance 

Next, we fit a multivariate Gaussian distribution for each 
individual organism by calculating a 12-dimensional 
mean vector and covariance matrix, (e.g., for E. coli 536 
which has 66 unique peptides, the Gaussian will be fitted 
based on a 66 x 12 matrix). 

The Euclidean distance between means of peptide se- 
quence spaces is not suitable for measuring the similar- 
ity between the C-terminal (3-strands of different 
organisms. Instead, the similarity measure should also 
represent how strongly their associated sequence spaces 
overlap. To achieve this we used the Hellinger distance 
between the fitted Gaussian distributions [38]. In statis- 
tical theory, the Hellinger distance measures the similar- 
ity between two probability distribution functions, by 
calculating the overlap between the distributions. For a 
better understanding, Figure 11 illustrates the difference 
between the Euclidean distance and the Hellinger dis- 
tance for one-dimensional Gaussian distributions. The 
Hellinger distance, D H (Org!,Org 2 ), between two distri- 
butions Orgl(x) and Org2(x) is symmetric and falls be- 
tween 0 and 1. D H (Org!, Org 2 ) is 0 when both 
distributions are identical; it is 1 if the distributions do 
not overlap [39] . Therefore we have for the squared Hel- 
linger distance Dj^Org!, Org 2 ) = 1 - overlap(Orgl, 
Org2). The following equation (1) was derived to calcu- 
late the pairwise Hellinger distance between the multi- 
variate Gaussian distributions, Org! and Org 2 , where Ui 
and u 2 are the mean vectors and Z 2 and Z 2 are the co- 
variance matrices of Org! and Org 2 , and d is the dimen- 
sion of the sequence space, i.e. d=12 
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Figure 11 Illustration of the difference between the Euclidean distance and the Hellinger distance for one-dimensional Gaussian 
distributions. Two Gaussian distributions are shown as black lines for different choices of u and o. The grey area indicates the overlap between 
both distributions. |u1-u2| is the Euclidean distance between the centers of the Gaussians, D H is the Hellinger distance (equation 1). Both values 
are indicated in the title of panels A-D. A: For u1 = u2 = 0, o1 = o2 = 1, the Euclidean distance and the Hellinger distance are both zero. B: For 
u1 = u2 = 0, o1 =1, o2 = 5 the Euclidean distance is zero, whereas the Hellinger distance is larger than zero because the distributions do not 
overlap perfectly (the second Gaussian is wider than the first). C: For u1 =0, u2 = 5, o1 = o2 = 1, the Euclidean distance is five, whereas the 
Hellinger distance almost attains its maximum because the distributions only overlap little. D: For u1 =0, u2 = 5, o1 =1, o2 =5, the Euclidean 
distance is still five as in C because the means did not change. However, the Hellinger distance is larger than in C because the second Gaussian 
is wider, which leads to a larger overlap between the distributions. 



CLANS 

Next, the Hellinger distance was used to define a dis- 
similarity matrix for all pairs of organisms. The dissimi- 
larity matrix was converted to P-values, which were then 
used as input in CLANS [20] to compute a cluster map 
showing all organisms. CLANS is a graph-based cluster- 
ing method that represents sequences as nodes. All 
nodes are connected by weighted edges where the pair- 
wise similarity between the sequences determines the 
strength of the weight [20]. In our study, individual 
organisms were considered as nodes and the weight of 
the edges connecting the nodes was based on the pair- 
wise Hellinger distance (pairwise overlap of sequence 
space) between the organisms. Hence stronger 



connections represent a larger overlap/similarity be- 
tween the peptide sequence spaces, while organisms 
with high divergence in their C-terminal motifs are only 
weakly connected or completely disconnected in the 
cluster map. Initially the nodes are randomly placed in a 
2D space and experience attraction forces according to 
how strongly they are connected with the other nodes. 
In an iterative refinement scheme, nodes move towards 
similar nodes with an attractive force proportional to the 
similarity between them. A small, overall repulsive force 
is applied to all pairs of nodes to keep them from col- 
lapsing into a single node. Since CLANS [20] uses non- 
deterministic dynamics, each run performed with the 
same dataset will result in a similar but not necessarily 
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identical clustering. Thus, multiple clustering runs were 
performed to check the reproducibility of the final clus- 
tering. Because initial tests showed that with the default 
attraction and repulsion values nodes (organisms) were 
collapsing, we used very small attraction values (up to 
0.1) and high repulsion values (up to 500) to avoid col- 
lapse of nodes and to obtain visually better clusters. 

Frequency plot 

The WebLogo [40] online tool was used to create the 
frequency plots, using custom colors. Only unique pep- 
tide sequences were used to generate all the frequency 
plots. The amino acid percentage plots were created 
using R version 2.13.1 [41]. 

Additional files 



Additional file 1: The figure shows the number the over 
representation of OMP.16 proteins among (3-proteobacteria and 
OMP.22 among a-proteobacteria. 

Additional file 2: The table lists the number of OMPs in an 
organism present in different OMP classes. 
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